The dataset comprises information on 1000 customers, with 84 features derived from their financial transactions and current financial standing. The primary objective is to leverage this dataset for credit risk estimation and predicting potential defaults.
CUST_ID: Unique customer identifier Key Target Variables:
CREDIT_SCORE: Numerical target variable representing the customer's credit score (integer)
DEFAULT: Binary target variable indicating if the customer has defaulted (1) or not (0) Description of Features:
INCOME: Total income in the last 12 months
R_[GROUP]DEBT: Ratio of T[GROUP]_12 to DEBT ### Categorical Features:
CAT_GAMBLING: Gambling category (none, low, high)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')
np.random.seed = 42
df=pd.read_csv('../../data/credit_score.csv')
# take without first column and last
X=df.iloc[:,1:-1]
y=df.iloc[:,-1]
df[["CUST_ID"]]
| CUST_ID | |
|---|---|
| 0 | C02COQEVYU |
| 1 | C02OZKC0ZF |
| 2 | C03FHP2D0A |
| 3 | C03PVPPHOY |
| 4 | C04J69MUX0 |
| ... | ... |
| 995 | CZQHJC9HDH |
| 996 | CZRA4MLB0P |
| 997 | CZSOD1KVFX |
| 998 | CZWC76UAUT |
| 999 | CZZV5B3SAL |
1000 rows × 1 columns
We will remove ID column, because it is not useful for our model.
X
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE_INCOME | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_GAMBLING | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33269 | 0 | 532304 | 0.0000 | 16.0000 | 1.2000 | 1889 | 945 | 0.5003 | 0.0568 | ... | 1.0000 | 0.0000 | 0.0625 | High | 1 | 0 | 0 | 0 | 0 | 444 |
| 1 | 77158 | 91187 | 315648 | 1.1818 | 4.0909 | 3.4615 | 5818 | 111 | 0.0191 | 0.0754 | ... | 0.9091 | 0.7692 | 0.2222 | No | 1 | 0 | 0 | 1 | 0 | 625 |
| 2 | 30917 | 21642 | 534864 | 0.7000 | 17.3000 | 24.7142 | 1157 | 860 | 0.7433 | 0.0374 | ... | 1.0000 | 1.4286 | 0.0578 | High | 1 | 0 | 0 | 1 | 0 | 469 |
| 3 | 80657 | 64526 | 629125 | 0.8000 | 7.8000 | 9.7499 | 6857 | 3686 | 0.5376 | 0.0850 | ... | 1.0000 | 1.2500 | 0.1282 | High | 1 | 0 | 0 | 1 | 0 | 559 |
| 4 | 149971 | 1172498 | 2399531 | 7.8182 | 16.0000 | 2.0465 | 1978 | 322 | 0.1628 | 0.0132 | ... | 0.9091 | 0.1163 | 0.0568 | High | 1 | 1 | 1 | 1 | 1 | 473 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 328892 | 1465066 | 5501471 | 4.4546 | 16.7273 | 3.7551 | 16701 | 10132 | 0.6067 | 0.0508 | ... | 0.9091 | 0.2041 | 0.0543 | High | 1 | 1 | 1 | 1 | 1 | 418 |
| 996 | 81404 | 88805 | 680837 | 1.0909 | 8.3637 | 7.6667 | 5400 | 1936 | 0.3585 | 0.0663 | ... | 0.9091 | 0.8333 | 0.1087 | No | 1 | 0 | 0 | 1 | 0 | 589 |
| 997 | 0 | 42428 | 30760 | 3.2379 | 8.1889 | 0.7250 | 0 | 0 | 0.8779 | 0.0047 | ... | 1.0668 | 0.2500 | 0.3448 | No | 1 | 0 | 0 | 1 | 0 | 499 |
| 998 | 36011 | 8002 | 604181 | 0.2222 | 16.7777 | 75.5037 | 1993 | 1271 | 0.6377 | 0.0553 | ... | 1.1111 | 5.0002 | 0.0662 | No | 1 | 1 | 0 | 1 | 0 | 507 |
| 999 | 44266 | 309859 | 44266 | 6.9999 | 1.0000 | 0.1429 | 1574 | 1264 | 0.8030 | 0.0356 | ... | 1.1111 | 0.1587 | 1.1111 | No | 1 | 0 | 0 | 1 | 0 | 657 |
1000 rows × 85 columns
y
0 1
1 0
2 1
3 0
4 0
..
995 0
996 1
997 0
998 0
999 0
Name: DEFAULT, Length: 1000, dtype: int64
# Split whole dataset into two parts: for modellers and for validators (70% and 30%)
X_mod, X_val, y_mod, y_val = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
print(X_mod.shape, X_val.shape, y_mod.shape, y_val.shape)
(700, 85) (300, 85) (700,) (300,)
df_validators=pd.concat([X_val, y_val], axis=1)
df_validators.to_csv('../../data/for_validators/credit_score_validators.csv', index=False)
X_train,X_valid,y_train,y_valid = train_test_split(X_mod, y_mod, test_size=0.3, random_state=42, stratify=y_mod)
print(X_train.shape, X_valid.shape, y_train.shape, y_valid.shape)
(490, 85) (210, 85) (490,) (210,)
df_modellers=pd.concat([X_train, y_train], axis=1)
#df_modellers
df_modellers.to_csv('../../data/for_modelling/credit_score_train.csv', index=False)
# Split validation part from the first part into two parts: for validation and for testing (66% and 34%)
X_validation,X_test,y_validation,y_test = train_test_split(X_valid, y_valid, test_size=0.34, random_state=42, stratify=y_valid)
print(X_validation.shape, X_test.shape, y_validation.shape, y_test.shape)
(138, 85) (72, 85) (138,) (72,)
df_valid=pd.concat([X_validation, y_validation], axis=1)
df_valid.to_csv('../../data/for_modelling/credit_score_valid.csv', index=False)
df_test=pd.concat([X_test, y_test], axis=1)
df_test.to_csv('../../data/for_modelling/credit_score_test.csv', index=False)
Let's move to EDA
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib.pyplot as plt
df_train = pd.read_csv('../../data/for_modelling/credit_score_train.csv')
df_val = pd.read_csv('../../data/for_modelling/credit_score_valid.csv')
df=pd.concat([df_train, df_val])
df.head()
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_GAMBLING | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2783 | 1855 | 0 | 0.6665 | 0.00 | 0.0000 | 103 | 74 | 0.7184 | 0.0370 | ... | 2.5003 | 0.0000 | No | 0 | 0 | 0 | 1 | 0 | 570 | 0 |
| 1 | 314430 | 445442 | 707468 | 1.4167 | 2.25 | 1.5882 | 35861 | 29157 | 0.8131 | 0.1141 | ... | 0.5882 | 0.3704 | High | 1 | 0 | 1 | 1 | 0 | 691 | 0 |
| 2 | 161773 | 517674 | 2782496 | 3.2000 | 17.20 | 5.3750 | 3716 | 2533 | 0.6816 | 0.0230 | ... | 0.3125 | 0.0581 | No | 1 | 1 | 1 | 1 | 1 | 520 | 0 |
| 3 | 16014 | 97685 | 20818 | 6.1000 | 1.30 | 0.2131 | 637 | 187 | 0.2936 | 0.0398 | ... | 0.1639 | 0.7692 | No | 1 | 0 | 0 | 1 | 0 | 654 | 0 |
| 4 | 193225 | 1410542 | 2589215 | 7.3000 | 13.40 | 1.8356 | 5276 | 2325 | 0.4407 | 0.0273 | ... | 0.1370 | 0.0746 | No | 1 | 1 | 0 | 1 | 1 | 552 | 0 |
5 rows × 86 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 628 entries, 0 to 137 Data columns (total 86 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 INCOME 628 non-null int64 1 SAVINGS 628 non-null int64 2 DEBT 628 non-null int64 3 R_SAVINGS_INCOME 628 non-null float64 4 R_DEBT_INCOME 628 non-null float64 5 R_DEBT_SAVINGS 628 non-null float64 6 T_CLOTHING_12 628 non-null int64 7 T_CLOTHING_6 628 non-null int64 8 R_CLOTHING 628 non-null float64 9 R_CLOTHING_INCOME 628 non-null float64 10 R_CLOTHING_SAVINGS 628 non-null float64 11 R_CLOTHING_DEBT 628 non-null float64 12 T_EDUCATION_12 628 non-null int64 13 T_EDUCATION_6 628 non-null int64 14 R_EDUCATION 628 non-null float64 15 R_EDUCATION_INCOME 628 non-null float64 16 R_EDUCATION_SAVINGS 628 non-null float64 17 R_EDUCATION_DEBT 628 non-null float64 18 T_ENTERTAINMENT_12 628 non-null int64 19 T_ENTERTAINMENT_6 628 non-null int64 20 R_ENTERTAINMENT 628 non-null float64 21 R_ENTERTAINMENT_INCOME 628 non-null float64 22 R_ENTERTAINMENT_SAVINGS 628 non-null float64 23 R_ENTERTAINMENT_DEBT 628 non-null float64 24 T_FINES_12 628 non-null int64 25 T_FINES_6 628 non-null int64 26 R_FINES 628 non-null float64 27 R_FINES_INCOME 628 non-null float64 28 R_FINES_SAVINGS 628 non-null float64 29 R_FINES_DEBT 628 non-null float64 30 T_GAMBLING_12 628 non-null int64 31 T_GAMBLING_6 628 non-null int64 32 R_GAMBLING 628 non-null float64 33 R_GAMBLING_INCOME 628 non-null float64 34 R_GAMBLING_SAVINGS 628 non-null float64 35 R_GAMBLING_DEBT 628 non-null float64 36 T_GROCERIES_12 628 non-null int64 37 T_GROCERIES_6 628 non-null int64 38 R_GROCERIES 628 non-null float64 39 R_GROCERIES_INCOME 628 non-null float64 40 R_GROCERIES_SAVINGS 628 non-null float64 41 R_GROCERIES_DEBT 628 non-null float64 42 T_HEALTH_12 628 non-null int64 43 T_HEALTH_6 628 non-null int64 44 R_HEALTH 628 non-null float64 45 R_HEALTH_INCOME 628 non-null float64 46 R_HEALTH_SAVINGS 628 non-null float64 47 R_HEALTH_DEBT 628 non-null float64 48 T_HOUSING_12 628 non-null int64 49 T_HOUSING_6 628 non-null int64 50 R_HOUSING 628 non-null float64 51 R_HOUSING_INCOME 628 non-null float64 52 R_HOUSING_SAVINGS 628 non-null float64 53 R_HOUSING_DEBT 628 non-null float64 54 T_TAX_12 628 non-null int64 55 T_TAX_6 628 non-null int64 56 R_TAX 628 non-null float64 57 R_TAX_INCOME 628 non-null float64 58 R_TAX_SAVINGS 628 non-null float64 59 R_TAX_DEBT 628 non-null float64 60 T_TRAVEL_12 628 non-null int64 61 T_TRAVEL_6 628 non-null int64 62 R_TRAVEL 628 non-null float64 63 R_TRAVEL_INCOME 628 non-null float64 64 R_TRAVEL_SAVINGS 628 non-null float64 65 R_TRAVEL_DEBT 628 non-null float64 66 T_UTILITIES_12 628 non-null int64 67 T_UTILITIES_6 628 non-null int64 68 R_UTILITIES 628 non-null float64 69 R_UTILITIES_INCOME 628 non-null float64 70 R_UTILITIES_SAVINGS 628 non-null float64 71 R_UTILITIES_DEBT 628 non-null float64 72 T_EXPENDITURE_12 628 non-null int64 73 T_EXPENDITURE_6 628 non-null int64 74 R_EXPENDITURE 628 non-null float64 75 R_EXPENDITURE_INCOME 628 non-null float64 76 R_EXPENDITURE_SAVINGS 628 non-null float64 77 R_EXPENDITURE_DEBT 628 non-null float64 78 CAT_GAMBLING 628 non-null object 79 CAT_DEBT 628 non-null int64 80 CAT_CREDIT_CARD 628 non-null int64 81 CAT_MORTGAGE 628 non-null int64 82 CAT_SAVINGS_ACCOUNT 628 non-null int64 83 CAT_DEPENDENTS 628 non-null int64 84 CREDIT_SCORE 628 non-null int64 85 DEFAULT 628 non-null int64 dtypes: float64(51), int64(34), object(1) memory usage: 426.8+ KB
#how many categorical and numerical columns
cat_cols = df.select_dtypes(include='object').columns
num_cols = df.select_dtypes(include='number').columns
print(f'categorical columns: {cat_cols}')
print(f'numerical columns: {num_cols}')
print(f'number of categorical columns: {len(cat_cols)}')
print(f'number of numerical columns: {len(num_cols)}')
categorical columns: Index(['CAT_GAMBLING'], dtype='object')
numerical columns: Index(['INCOME', 'SAVINGS', 'DEBT', 'R_SAVINGS_INCOME', 'R_DEBT_INCOME',
'R_DEBT_SAVINGS', 'T_CLOTHING_12', 'T_CLOTHING_6', 'R_CLOTHING',
'R_CLOTHING_INCOME', 'R_CLOTHING_SAVINGS', 'R_CLOTHING_DEBT',
'T_EDUCATION_12', 'T_EDUCATION_6', 'R_EDUCATION', 'R_EDUCATION_INCOME',
'R_EDUCATION_SAVINGS', 'R_EDUCATION_DEBT', 'T_ENTERTAINMENT_12',
'T_ENTERTAINMENT_6', 'R_ENTERTAINMENT', 'R_ENTERTAINMENT_INCOME',
'R_ENTERTAINMENT_SAVINGS', 'R_ENTERTAINMENT_DEBT', 'T_FINES_12',
'T_FINES_6', 'R_FINES', 'R_FINES_INCOME', 'R_FINES_SAVINGS',
'R_FINES_DEBT', 'T_GAMBLING_12', 'T_GAMBLING_6', 'R_GAMBLING',
'R_GAMBLING_INCOME', 'R_GAMBLING_SAVINGS', 'R_GAMBLING_DEBT',
'T_GROCERIES_12', 'T_GROCERIES_6', 'R_GROCERIES', 'R_GROCERIES_INCOME',
'R_GROCERIES_SAVINGS', 'R_GROCERIES_DEBT', 'T_HEALTH_12', 'T_HEALTH_6',
'R_HEALTH', 'R_HEALTH_INCOME', 'R_HEALTH_SAVINGS', 'R_HEALTH_DEBT',
'T_HOUSING_12', 'T_HOUSING_6', 'R_HOUSING', 'R_HOUSING_INCOME',
'R_HOUSING_SAVINGS', 'R_HOUSING_DEBT', 'T_TAX_12', 'T_TAX_6', 'R_TAX',
'R_TAX_INCOME', 'R_TAX_SAVINGS', 'R_TAX_DEBT', 'T_TRAVEL_12',
'T_TRAVEL_6', 'R_TRAVEL', 'R_TRAVEL_INCOME', 'R_TRAVEL_SAVINGS',
'R_TRAVEL_DEBT', 'T_UTILITIES_12', 'T_UTILITIES_6', 'R_UTILITIES',
'R_UTILITIES_INCOME', 'R_UTILITIES_SAVINGS', 'R_UTILITIES_DEBT',
'T_EXPENDITURE_12', 'T_EXPENDITURE_6', 'R_EXPENDITURE',
'R_EXPENDITURE_INCOME', 'R_EXPENDITURE_SAVINGS', 'R_EXPENDITURE_DEBT',
'CAT_DEBT', 'CAT_CREDIT_CARD', 'CAT_MORTGAGE', 'CAT_SAVINGS_ACCOUNT',
'CAT_DEPENDENTS', 'CREDIT_SCORE', 'DEFAULT'],
dtype='object')
number of categorical columns: 1
number of numerical columns: 85
#check CAT_GAMBLING values
df['CAT_GAMBLING'].value_counts()
CAT_GAMBLING No 392 High 163 Low 73 Name: count, dtype: int64
df.describe()
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE_INCOME | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 628.000000 | 6.280000e+02 | 6.280000e+02 | 628.000000 | 628.000000 | 628.000000 | 628.000000 | 628.000000 | 628.000000 | 628.000000 | ... | 628.000000 | 628.000000 | 628.000000 | 628.000000 | 628.000000 | 628.00000 | 628.000000 | 628.000000 | 628.000000 | 628.000000 |
| mean | 122852.219745 | 4.234954e+05 | 8.315110e+05 | 4.251653 | 6.050788 | 5.730516 | 6793.633758 | 3445.570064 | 0.461276 | 0.054612 | ... | 0.939313 | 0.897402 | 0.590555 | 0.944268 | 0.243631 | 0.18949 | 0.995223 | 0.171975 | 587.060510 | 0.283439 |
| std | 111735.273010 | 4.459637e+05 | 1.013800e+06 | 4.045379 | 5.793431 | 15.021525 | 7456.934234 | 5146.191291 | 0.236514 | 0.035962 | ... | 0.165736 | 1.597915 | 1.269410 | 0.229587 | 0.429614 | 0.39221 | 0.069006 | 0.377659 | 62.146964 | 0.451027 |
| min | 0.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.003300 | 0.003700 | ... | 0.666700 | 0.067600 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 300.000000 | 0.000000 |
| 25% | 30494.500000 | 7.529625e+04 | 4.978375e+04 | 1.100000 | 1.500000 | 0.203100 | 1127.000000 | 295.000000 | 0.269450 | 0.029675 | ... | 0.833300 | 0.148750 | 0.099000 | 1.000000 | 0.000000 | 0.00000 | 1.000000 | 0.000000 | 555.000000 | 0.000000 |
| 50% | 86994.000000 | 2.946345e+05 | 4.056500e+05 | 2.713650 | 4.883350 | 2.000000 | 4380.000000 | 1319.500000 | 0.475500 | 0.045550 | ... | 0.909100 | 0.303000 | 0.178600 | 1.000000 | 0.000000 | 0.00000 | 1.000000 | 0.000000 | 593.500000 | 0.000000 |
| 75% | 183004.500000 | 6.272762e+05 | 1.329519e+06 | 6.999925 | 8.725000 | 4.540175 | 10080.750000 | 4638.000000 | 0.630000 | 0.067200 | ... | 1.000000 | 0.833300 | 0.588225 | 1.000000 | 0.000000 | 0.00000 | 1.000000 | 0.000000 | 630.000000 | 1.000000 |
| max | 545667.000000 | 2.738164e+06 | 5.968620e+06 | 16.111200 | 37.000600 | 166.978800 | 43255.000000 | 39918.000000 | 1.058300 | 0.241300 | ... | 2.000000 | 10.009900 | 10.005300 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 | 800.000000 | 1.000000 |
8 rows × 85 columns
target='DEFAULT'
df[target].value_counts()
DEFAULT 0 450 1 178 Name: count, dtype: int64
# encoding categorical variable CAT_GAMBLING to numerical
df['CAT_GAMBLING'] = df['CAT_GAMBLING'].map({'No': 0, 'Low': 1,'High': 2})
df
| INCOME | SAVINGS | DEBT | R_SAVINGS_INCOME | R_DEBT_INCOME | R_DEBT_SAVINGS | T_CLOTHING_12 | T_CLOTHING_6 | R_CLOTHING | R_CLOTHING_INCOME | ... | R_EXPENDITURE_SAVINGS | R_EXPENDITURE_DEBT | CAT_GAMBLING | CAT_DEBT | CAT_CREDIT_CARD | CAT_MORTGAGE | CAT_SAVINGS_ACCOUNT | CAT_DEPENDENTS | CREDIT_SCORE | DEFAULT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2783 | 1855 | 0 | 0.6665 | 0.0000 | 0.0000 | 103 | 74 | 0.7184 | 0.0370 | ... | 2.5003 | 0.0000 | 0 | 0 | 0 | 0 | 1 | 0 | 570 | 0 |
| 1 | 314430 | 445442 | 707468 | 1.4167 | 2.2500 | 1.5882 | 35861 | 29157 | 0.8131 | 0.1141 | ... | 0.5882 | 0.3704 | 2 | 1 | 0 | 1 | 1 | 0 | 691 | 0 |
| 2 | 161773 | 517674 | 2782496 | 3.2000 | 17.2000 | 5.3750 | 3716 | 2533 | 0.6816 | 0.0230 | ... | 0.3125 | 0.0581 | 0 | 1 | 1 | 1 | 1 | 1 | 520 | 0 |
| 3 | 16014 | 97685 | 20818 | 6.1000 | 1.3000 | 0.2131 | 637 | 187 | 0.2936 | 0.0398 | ... | 0.1639 | 0.7692 | 0 | 1 | 0 | 0 | 1 | 0 | 654 | 0 |
| 4 | 193225 | 1410542 | 2589215 | 7.3000 | 13.4000 | 1.8356 | 5276 | 2325 | 0.4407 | 0.0273 | ... | 0.1370 | 0.0746 | 0 | 1 | 1 | 0 | 1 | 1 | 552 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 133 | 386976 | 1250231 | 654883 | 3.2308 | 1.6923 | 0.5238 | 36665 | 30569 | 0.8337 | 0.0947 | ... | 0.2381 | 0.4545 | 0 | 1 | 0 | 0 | 1 | 0 | 562 | 0 |
| 134 | 16772 | 8386 | 0 | 0.5000 | 0.0000 | 0.0000 | 238 | 186 | 0.7815 | 0.0142 | ... | 2.5000 | 0.9786 | 0 | 0 | 0 | 0 | 1 | 0 | 619 | 0 |
| 135 | 137509 | 206264 | 206264 | 1.5000 | 1.5000 | 1.0000 | 15661 | 10865 | 0.6938 | 0.1139 | ... | 0.6667 | 0.6667 | 0 | 1 | 0 | 0 | 1 | 0 | 639 | 1 |
| 136 | 259146 | 1922000 | 1662854 | 7.4167 | 6.4167 | 0.8652 | 12180 | 8527 | 0.7001 | 0.0470 | ... | 0.1124 | 0.1299 | 2 | 1 | 1 | 0 | 1 | 1 | 587 | 1 |
| 137 | 56657 | 442952 | 51506 | 7.8181 | 0.9091 | 0.1163 | 1085 | 289 | 0.2664 | 0.0192 | ... | 0.1163 | 1.0000 | 0 | 1 | 0 | 0 | 1 | 0 | 635 | 1 |
628 rows × 86 columns
We have encoded CAT_GAMBLING
df_without_cat = df.drop(columns=['CAT_GAMBLING'])
On the advice of the validators, we remove categorical variables from the correlation matrix.
# correlation matrix
corr = df_without_cat.corr()
plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=False, cmap='coolwarm', fmt=".2f")
plt.show()
high_correlation_matrix= corr[abs(corr)>0.7]
#plot
plt.figure(figsize=(20, 20))
sns.heatmap(high_correlation_matrix, annot=False, cmap='coolwarm', fmt=".2f")
plt.show()
We have 3 similar columns
We might remove 2 of them but 100% sure that we will remove 1 of them
R_DEBT_INCOME and CREDIT_SCORE are highly correlated.
correlation_matrix = df.corr()
plt.figure(figsize=(12, 20))
sns.heatmap(correlation_matrix[['DEFAULT','CREDIT_SCORE']].sort_values(by='DEFAULT', ascending=False), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation with CREDIT_SCORE, DEFAULT and other variables')
plt.show()
There is a clear negative correlation between CREDIT_SCORE and DEFAULT. This indicates that lower credit scores are associated with a higher risk of default. This result is expected, as the credit score is an indicator of an individual's creditworthiness
The variable R_DEBT_INCOME shows the highest positive correlation with DEFAULT, suggesting that a higher debt-to-income ratio may increase the risk of default. This is understandable, as a higher debt-to-income ratio may indicate a greater financial burden relative to a person's earning capacity. However, we can also see that this variable has the highest negative correlation with Credit Score, which suggests that this will be a very important variable in predicting default and credit score
Variables like CAT_MORTGAGE, CAT_SAVINGS_ACCOUNT and those related to expenses (such as T_GROCERIES_12, T_GAMBLING_12, T_UTILITIES_6) show low correlation both with CREDIT_SCORE and DEFAULT, which might suggest that they do not significantly affect credit risk and default
# show DEBT and CAT_DEBT column
df[['DEBT', 'CAT_DEBT']]
| DEBT | CAT_DEBT | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 707468 | 1 |
| 2 | 2782496 | 1 |
| 3 | 20818 | 1 |
| 4 | 2589215 | 1 |
| ... | ... | ... |
| 133 | 654883 | 1 |
| 134 | 0 | 0 |
| 135 | 206264 | 1 |
| 136 | 1662854 | 1 |
| 137 | 51506 | 1 |
628 rows × 2 columns
# zobaczmy rozklad wszystkich wartości, oprócz tych gdzie wartości są z przedziału tylko 0 lub 1
columns_to_plot = [col for col in df.columns if not df[col].nunique() in [1, 2]]
df[columns_to_plot].hist(bins=40, figsize=(20, 20))
plt.tight_layout()
plt.show()
FEATURES THAT HAVE NORMAL DISTRIBUTION
CANDIDATES - FEATURES THAT CAN BE TRANSFORMED TO NORMAL DISTRIBUTION PROBABLY
INTERESTING FEATURES IDK WHAT TO DO WITH THEM
# WE ARE SHOWING SCATTERPLOT FOR ALL FEATURES EXCEPT DEFAULT
fig, axs = plt.subplots(28, 3, figsize=(16, 100))
axs = axs.flatten()
for i, col in enumerate(df.drop(columns=['CREDIT_SCORE']).columns):
sns.scatterplot(data=df, x=col, y=df["CREDIT_SCORE"], ax=axs[i], s=15, hue='DEFAULT')
plt.suptitle("Target variable vs. features")
plt.tight_layout()
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[26], line 5 3 axs = axs.flatten() 4 for i, col in enumerate(df.drop(columns=['CREDIT_SCORE']).columns): ----> 5 sns.scatterplot(data=df, x=col, y=df["CREDIT_SCORE"], ax=axs[i], s=15, hue='DEFAULT') 6 plt.suptitle("Target variable vs. features") 7 plt.tight_layout() IndexError: index 84 is out of bounds for axis 0 with size 84
(MY DOPISZEMY)
sns.boxplot(x='CAT_GAMBLING', y='CREDIT_SCORE', data=df)
plt.title('Boxplot of CREDIT_SCORE by CAT_GAMBLING')
plt.xlabel('CAT_GAMBLING')
plt.ylabel('CREDIT_SCORE')
plt.show()
On the advice of the validators we make a boxplot for the categorical data. We can see that credit score is mostly lower, when cat_gambling is high.
# Wybieranie kolumn, których nazwy zaczynają się od "CAT"
cat_columns = df.filter(regex=r'^CAT')
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
# Tworzenie wykresu słupkowego dla każdej kolumny z "CAT" w zależności od "CREDIT_SCORE"
for i, column in enumerate(cat_columns):
row_index = i // 3 # Indeks wiersza
col_index = i % 3 # Indeks kolumny
ax = axes[row_index, col_index]
df.groupby(column)['CREDIT_SCORE'].mean().plot(kind='bar', ax=ax, title=f'Average CREDIT_SCORE by {column}')
ax.set_xlabel(column)
ax.set_ylabel('Average CREDIT_SCORE')
ax.tick_params(axis='x', rotation=0)
plt.tight_layout()
plt.show()
,,binary variables could be examined with bar charts' ~validators:) So we did it. summary:
average_default = df.groupby('CAT_GAMBLING')['DEFAULT'].mean()
print(average_default)
CAT_GAMBLING 0 0.260204 1 0.273973 2 0.343558 Name: DEFAULT, dtype: float64
The higher 'CAT_GAMBLING' the higher 'DEFAULT' (so higher chance of not paying your debt)
# show my only R_Fines_SAvings column
df[['R_FINES_SAVINGS']]
| R_FINES_SAVINGS | |
|---|---|
| 0 | 0.0 |
| 1 | 0.0 |
| 2 | 0.0 |
| 3 | 0.0 |
| 4 | 0.0 |
| ... | ... |
| 133 | 0.0 |
| 134 | 0.0 |
| 135 | 0.0 |
| 136 | 0.0 |
| 137 | 0.0 |
628 rows × 1 columns
Let's look closer at income
bins = [0, 10000, 30000, 60000, 100000, 150000, 662094]
# Obliczanie koszyków automatycznie
df['income_cat'] = pd.cut(df['INCOME'], bins=bins)
average_default = df.groupby('income_cat')['DEFAULT'].mean()
print(average_default)
income_cat (0, 10000] 0.153846 (10000, 30000] 0.336735 (30000, 60000] 0.222222 (60000, 100000] 0.259740 (100000, 150000] 0.282353 (150000, 662094] 0.305164 Name: DEFAULT, dtype: float64
It seems that income doesn't have much influence on repaying the loan, but We didn't consider the possibility that the more someone earns, the larger loan they may take. It might be the case that smaller loans are more often repaid by those who earn more, but those who earn less don't take out large loans, which possibly might be repaid "worse."
average_default = df.groupby('DEFAULT')['INCOME'].mean()
print(average_default)
DEFAULT 0 119266.653333 1 131916.853933 Name: INCOME, dtype: float64
Exactly :)
average_default = df.groupby('DEFAULT')['SAVINGS'].mean()
print(average_default)
DEFAULT 0 418978.322222 1 434914.865169 Name: SAVINGS, dtype: float64
It's the same with savings.
# Let's look closer at INCOME
low_debt = df[df['DEBT'] <= 5.396675e+04]
average_default = low_debt.groupby('income_cat')['DEFAULT'].mean()
print(average_default)
average_debt = df[(df['DEBT'] > 5.396675e+04) & (df['DEBT'] <= 3.950955e+05)]
average_default2 = average_debt.groupby('income_cat')['DEFAULT'].mean()
print(average_default2)
high_debt = df[(df['DEBT'] > 3.950955e+05) & (df['DEBT'] <= 1.193230e+06)]
average_default2 = high_debt.groupby('income_cat')['DEFAULT'].mean()
print(average_default2)
highest_debt = df[df['DEBT'] > 1.193230e+06]
average_default2 = highest_debt.groupby('income_cat')['DEFAULT'].mean()
print(average_default2)
income_cat (0, 10000] 0.181818 (10000, 30000] 0.200000 (30000, 60000] 0.159091 (60000, 100000] 0.000000 (100000, 150000] NaN (150000, 662094] NaN Name: DEFAULT, dtype: float64 income_cat (0, 10000] 0.000000 (10000, 30000] 0.400000 (30000, 60000] 0.266667 (60000, 100000] 0.219512 (100000, 150000] 0.157895 (150000, 662094] 0.111111 Name: DEFAULT, dtype: float64 income_cat (0, 10000] NaN (10000, 30000] 0.923077 (30000, 60000] 0.300000 (60000, 100000] 0.333333 (100000, 150000] 0.238095 (150000, 662094] 0.193548 Name: DEFAULT, dtype: float64 income_cat (0, 10000] NaN (10000, 30000] NaN (30000, 60000] NaN (60000, 100000] 1.000000 (100000, 150000] 0.458333 (150000, 662094] 0.366197 Name: DEFAULT, dtype: float64
As we suspected, individuals with low income don't take out very large loans. With the division into ranges of loan values, it's already noticeable that income has a significant impact on loan repayment: the lower the income, the more unpaid loans there are.
We will build a model to check importance of features
df_train = pd.read_csv('../../data/for_modelling/credit_score_train.csv')
df_val = pd.read_csv('../../data/for_modelling/credit_score_valid.csv')
df=pd.concat([df_train, df_val])
#map CAT_GAMBLING to numerical
df['CAT_GAMBLING'] = df['CAT_GAMBLING'].map({'No': 0, 'Low': 1,'High': 2})
X=df.drop(columns=['DEFAULT'])
y=df['DEFAULT']
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)
importances = model.feature_importances_
features = X.columns
forest_importances = pd.Series(importances, index=features).sort_values(ascending=False)
forest_importances
CREDIT_SCORE 0.042118
R_DEBT_INCOME 0.026159
R_TAX_DEBT 0.024946
R_UTILITIES_DEBT 0.024116
R_ENTERTAINMENT_DEBT 0.021476
...
CAT_GAMBLING 0.000949
CAT_MORTGAGE 0.000904
CAT_DEBT 0.000376
CAT_DEPENDENTS 0.000259
CAT_SAVINGS_ACCOUNT 0.000000
Length: 85, dtype: float64
plt.figure(figsize=(20, 20))
forest_importances.plot(kind='barh')
<Axes: >
According to this model we could remove: